PYTHON-5536 Avoid clearing the connection pool when the server connection rate limiter triggers#2509
Conversation
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…b#2507) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ction rate limiter triggers
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Steven Silvester <steve.silvester@mongodb.com>
pymongo/asynchronous/pool.py
Outdated
| conn.conn.get_conn.read(1) | ||
| except Exception as _: | ||
| # TODO: verify the exception | ||
| close_conn = False |
There was a problem hiding this comment.
2 comments:
- I believe this logic needs to move to connection checkout. Here in connection check in we already know the connection is useable because we're checking it back in after a successful command.
- Instead of a 1ms read can we reuse the existing _perished() + conn_closed() methods?
(cherry picked from commit 0d4c84e)
|
Ideally we'd like to detect |
| else: | ||
| if self._closing_exception: | ||
| raise self._closing_exception | ||
| if self._closed.done(): |
There was a problem hiding this comment.
Is calling is_closing here better? It'll catch more edge cases in theory.
There was a problem hiding this comment.
No, it is ambiguous as to whether connection_lost as been called yet. Since connection_lost is synchronous, checking for self._closed.done() assures that we have actually lost the connection.
pymongo/asynchronous/pool.py
Outdated
| ): | ||
| self._backoff += 1 | ||
| error._add_error_label("SystemOverloaded") | ||
| error._add_error_label("Retryable") |
There was a problem hiding this comment.
Could you merge backpressure? Originally I added the incorrect labels here. It should be "SystemOverloadedError" and "RetryableError"
pymongo/asynchronous/pool.py
Outdated
| self._backoff += 1 | ||
| error._add_error_label("SystemOverloaded") | ||
| error._add_error_label("Retryable") | ||
| print(f"Setting backoff in {phase}:", self._backoff) # noqa: T201 |
There was a problem hiding this comment.
Instead of inspecting the error message after the fact, is it possible we can record some state to determine if the error happened during DNS+TCP or after? Like:
# Assume all non dns/tcp/timeout errors mean the server rejected the connection due to overload.
if not errorDuringDnsTcp and not timeoutError:
error._add_error_label("SystemOverloadedError")
There was a problem hiding this comment.
DNS is already resolved by the time we make a pool as far as I can tell, and we can't distinguish between TCP connection and TLS handshake for async.
There was a problem hiding this comment.
I added some logic in 7548f7b that looks for a specific error attached to AutoReconnect.
There was a problem hiding this comment.
For discussion: we can still run into the condition were we hit this line, and there is no closing error, and we have not yet received any data on the protocol. I verified this by setting a flag when buffer_updated is called. We don't have a way to ascribe more semantic meaning to this condition as far as I can tell from the Protocol/Transport docs and the available properties on each.
There was a problem hiding this comment.
The problem got even worse when using gevent, since the error was something completely different, so I reverted any extra handling of the connection error.
There was a problem hiding this comment.
After today's discussion I added 6a47fc6 to only add the labels if we have completed DNS+TCP connection.
pymongo/asynchronous/pool.py
Outdated
| self._backoff += 1 | ||
| error._add_error_label("SystemOverloadedError") | ||
| error._add_error_label("RetryableError") | ||
| print(f"Setting backoff in {phase}:", self._backoff) # noqa: T201 |
There was a problem hiding this comment.
Could we remove this print now?
There was a problem hiding this comment.
Do we want to log it somehow? In the Server Tech Design they call out "A message will be logged and the function will be returned from."
There was a problem hiding this comment.
Yes we'll add a CMAP log message and/or an a new event. Feel free to add a log message now or we can defer to the spec PR.
| self._handle_connection_error(error, "handshake") | ||
| if isinstance(error, (IOError, OSError, *SSLErrors)): | ||
| details = _get_timeout_details(self.opts) | ||
| _raise_connection_failure(self.address, error, timeout_details=details) |
There was a problem hiding this comment.
_raise_connection_failure will raise a new exception in which case the error labels won't be propagated. Is that a bug? I guess it's not because _raise_connection_failure only handles IOError, OSError, *SSLErrors while _handle_connection_error only handles AutoReconnect, right?
There was a problem hiding this comment.
Correct, they are orthogonal
|
Looks like there's still a few load balancer spec tests that fail due to the new behavior: https://spruce.mongodb.com/task/mongo_python_driver_backpressure_load_balancer_test_non_standard_latest_python3.14_auth_ssl_sharded_cluster_patch_c458379522a3ad0ff0cb577aea1f9fe22efd1bfe_68d32dc8fea2ae0007bdadfb_25_09_23_23_31_22/tests?execution=0&sorts=STATUS%3AASC |
…tion rate limiter triggers (#2509) Co-authored-by: Iris <58442094+sleepyStick@users.noreply.github.com> Co-authored-by: Noah Stapp <noah.stapp@mongodb.com> Co-authored-by: Shane Harvey <shnhrv@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> PYTHON-5629 Increase max overload retries from 3 to 5 and initial delay from 50ms to 100ms (#2599) PYTHON-5517 Simplify pool backpressure behavior (#2611) synchro update network_layer update pool shared update pool shared update run-tests
…tion rate limiter triggers (mongodb#2509) Co-authored-by: Iris <58442094+sleepyStick@users.noreply.github.com> Co-authored-by: Noah Stapp <noah.stapp@mongodb.com> Co-authored-by: Shane Harvey <shnhrv@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…r connection rate limiter triggers (mongodb#2509)" This reverts commit a99645e.
Currently testing with this script for async:
and this one for sync: